skip to main content


Search for: All records

Creators/Authors contains: "Harrison, Robert"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers. 
    more » « less
  2. Free, publicly-accessible full text available July 23, 2024
  3. Free, publicly-accessible full text available July 23, 2024
  4. Patterns in foliar nitrogen (N) stable isotope ratios (δ15N) have been shown to reveal trends in terrestrial N cycles, including the identification of ecosystems where N deficiencies limit forest ecosystem productivity. However, there is a gap in our understanding of within-species variation and species-level response to environmental gradients or forest management. Our objective is to examine the relationship between site index, foliar %N, foliar δ15N and spectral reflectance for managed Douglas-fir (Pseudotsuga menziesii) and loblolly pine (Pinus taeda) plantations across their geographic ranges in the Pacific Northwest and the southeastern United States, respectively. Foliage was measured at 28 sites for reflectance using a handheld spectroradiometer, and further analyzed for δ15N and N concentration. Unlike the prior work for grasslands and shrubland species, our results show that foliar δ15N and foliar %N are not well correlated for these tree species. However, multiple linear regression models suggest a strong predictive ability of spectroscopy data to quantify foliar δ15N, with some models explaining more than 65% of the variance in the δ15N. Additionally, moderate to strong explanations of variance were found between site index and foliar δ15N (R2 = 0.49) and reflectance and site index (R2 = 0.84) in the Douglas-fir data set. The development of relationships between foliar spectral reflectance, δ15N and measures of site productivity provides the first step toward mapping canopy δ15N for these managed forests with remote sensing. 
    more » « less
  5. We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations. 
    more » « less
  6. null (Ed.)
    On shared-memory multicore machines, classic two-way recursive divide-and-conquer algorithms are implemented using common fork-join based parallel programming paradigms such as Intel Cilk+ or OpenMP. However, in such parallel paradigms, the use of joins for synchronization may lead to artificial dependencies among function calls which are not implied by the underlying DP recurrence. These artificial dependencies can increase the span asymptotically and thus reduce parallelism. From a practical perspective, they can lead to resource underutilization, i.e., threads becoming idle. To eliminate such artificial dependencies, task-based runtime systems and data-flow parallel paradigms, such as Concurrent Collections (CnC), PaRSEC, and Legion have been introduced. Such parallel paradigms and runtime systems overcome the limitations of fork-join parallelism by specifying data dependencies at a finer granularity and allowing tasks to execute as soon as dependencies are satisfied.In this paper, we investigate how the performance of data-flow implementations of recursive divide-and-conquer based DP algorithms compare with fork-join implementations. We have designed and implemented data-flow versions of DP algorithms in Intel CnC and compared the performance with fork-join based implementations in OpenMP. Considering different execution parameters (e.g., algorithmic properties such as recursive base size as well as machine configuration such as the number of physical cores, etc), our results confirm that a data-flow based implementation outperforms its fork-join based counter-part when due to artificial dependencies, the fork-join implementation fails to generate enough subtasks to keep all processors busy and does not have enough data locality to compensate for the lost performance. This phenomena happens when the input size of the DP algorithm is small or we have a huge number of compute cores in the system. As a result, with a fixed computation resource, moving from small input to larger input, fork-join implementation of DP algorithms outperforms the corresponding data-flow implementation. However, for a fixed size problem, moving the computation to a compute node with a larger number of cores, data-flow implementation outperforms the corresponding fork-join implementation. 
    more » « less